Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.
translated by 谷歌翻译
交叉路口是自动驾驶任务最具挑战性的场景之一。由于复杂性和随机性,在相交处的基本应用(例如行为建模,运动预测,安全验证等)在很大程度上取决于数据驱动的技术。因此,交叉点中对流量参与者(TPS)的轨迹数据集的需求很大。目前,城市地区的大多数交叉路口都配备了交通信号灯。但是,尚无用于信号交叉点的大规模,高质量,公开可用的轨迹数据集。因此,在本文中,在中国天津选择了典型的两相信号交叉点。此外,管道旨在构建信号交叉数据集(SIND),其中包含7个小时的记录,其中包括13,000多种TPS,具有7种类型。然后,记录了信德的交通违规行为。此外,也将信德与其他类似作品进行比较。 SIND的特征可以概括如下:1)信德提供了更全面的信息,包括交通信号灯状态,运动参数,高清(HD)地图等。2)TPS的类别是多种多样和特征的,其中比例是脆弱的道路使用者(VRU)最高为62.6%3)显示了多次交通信号灯违反非电动车辆的行为。我们认为,Sind将是对现有数据集的有效补充,可以促进有关自动驾驶的相关研究。该数据集可通过以下方式在线获得:https://github.com/sotif-avlab/sind
translated by 谷歌翻译
由于在建模相互依存系统中,由于其高效用,多层图已经在许多领域获得了大量的研究。然而,多层图的聚类,其旨在将图形节点划分为类别或社区,仍处于新生阶段。现有方法通常限于利用MultiView属性或多个网络,并忽略更复杂和更丰富的网络框架。为此,我们向多层图形聚类提出了一种名为Multidayer agal对比聚类网络(MGCCN)的多层图形聚类的通用和有效的AutoEncoder框架。 MGCCN由三个模块组成:(1)应用机制以更好地捕获节点与邻居之间的相关性以获得更好的节点嵌入。 (2)更好地探索不同网络中的一致信息,引入了对比融合策略。 (3)MGCCN采用自我监督的组件,可迭代地增强节点嵌入和聚类。对不同类型的真实图数据数据的广泛实验表明我们所提出的方法优于最先进的技术。
translated by 谷歌翻译
在本文中,我们认为由于专家的昂贵的像素级注释以及大量未经发布的正常和异常图像扫描,近年来近年来引起了近年来越来越多的注意力的问题。我们介绍了一个分割网络,该分割网络利用对抗学习将图像分成两种切割,其中一个落入用户提供的参考分布。这种基于对抗的选择性切割网络(ASC-Net)桥接基于簇的深度分割和基于对抗基于对抗的异常/新奇检测算法的两个域。我们的ASC网络从正常和异常的医疗扫描中学到医疗扫描中的分段异常,没有任何掩盖的监督。我们在三个公共数据集中评估这一无监督的异常分段模型,即脑肿瘤细分的Brats 2019,肝脏病变分割和脑病变细分的MS-SEG 2015,以及脑肿瘤细分的私人数据集。与现有方法相比,我们的模型展示了无监督异常分段任务中的巨大性能增益。虽然与监督学习算法相比,仍有进一步提高性能的空间,但有希望的实验结果和有趣的观察揭示了使用用户定义的知识构建无监督学习算法的医疗异常识别。
translated by 谷歌翻译
The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard training regime. Through extensive experiments on various transformer architectures, we observe both improved performance and intriguing properties of these two plug-and-play designs with negligible computational overhead. These observations further indicate the importance of the commonly-omitted designs of tokenizers in vision transformer.
translated by 谷歌翻译
Lobster eye telescopes are ideal monitors to detect X-ray transients, because they could observe celestial objects over a wide field of view in X-ray band. However, images obtained by lobster eye telescopes are modified by their unique point spread functions, making it hard to design a high efficiency target detection algorithm. In this paper, we integrate several machine learning algorithms to build a target detection framework for data obtained by lobster eye telescopes. Our framework would firstly generate two 2D images with different pixel scales according to positions of photons on the detector. Then an algorithm based on morphological operations and two neural networks would be used to detect candidates of celestial objects with different flux from these 2D images. At last, a random forest algorithm will be used to pick up final detection results from candidates obtained by previous steps. Tested with simulated data of the Wide-field X-ray Telescope onboard the Einstein Probe, our detection framework could achieve over 94% purity and over 90% completeness for targets with flux more than 3 mCrab (9.6 * 10-11 erg/cm2/s) and more than 94% purity and moderate completeness for targets with lower flux at acceptable time cost. The framework proposed in this paper could be used as references for data processing methods developed for other lobster eye X-ray telescopes.
translated by 谷歌翻译
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
Generative adversarial networks (GANs) have made great success in image inpainting yet still have difficulties tackling large missing regions. In contrast, iterative algorithms, such as autoregressive and denoising diffusion models, have to be deployed with massive computing resources for decent effect. To overcome the respective limitations, we present a novel spatial diffusion model (SDM) that uses a few iterations to gradually deliver informative pixels to the entire image, largely enhancing the inference efficiency. Also, thanks to the proposed decoupled probabilistic modeling and spatial diffusion scheme, our method achieves high-quality large-hole completion. On multiple benchmarks, we achieve new state-of-the-art performance. Code is released at https://github.com/fenglinglwb/SDM.
translated by 谷歌翻译
Time series forecasting is a long-standing challenge due to the real-world information is in various scenario (e.g., energy, weather, traffic, economics, earthquake warning). However some mainstream forecasting model forecasting result is derailed dramatically from ground truth. We believe it's the reason that model's lacking ability of capturing frequency information which richly contains in real world datasets. At present, the mainstream frequency information extraction methods are Fourier transform(FT) based. However, use of FT is problematic due to Gibbs phenomenon. If the values on both sides of sequences differ significantly, oscillatory approximations are observed around both sides and high frequency noise will be introduced. Therefore We propose a novel frequency enhanced channel attention that adaptively modelling frequency interdependencies between channels based on Discrete Cosine Transform which would intrinsically avoid high frequency noise caused by problematic periodity during Fourier Transform, which is defined as Gibbs Phenomenon. We show that this network generalize extremely effectively across six real-world datasets and achieve state-of-the-art performance, we further demonstrate that frequency enhanced channel attention mechanism module can be flexibly applied to different networks. This module can improve the prediction ability of existing mainstream networks, which reduces 35.99% MSE on LSTM, 10.01% on Reformer, 8.71% on Informer, 8.29% on Autoformer, 8.06% on Transformer, etc., at a slight computational cost ,with just a few line of code. Our codes and data are available at https://github.com/Zero-coder/FECAM.
translated by 谷歌翻译